Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A study of style effects on OCR errors in the MEDLINE database

Identifieur interne : 001421 ( Main/Exploration ); précédent : 001420; suivant : 001422

A study of style effects on OCR errors in the MEDLINE database

Auteurs : Penny Garrison [États-Unis] ; Diane Davis [États-Unis] ; Tim Andersen [États-Unis] ; Elisa Barney Smith [États-Unis]

Source :

RBID : Pascal:05-0359370

Descripteurs français

English descriptors

Abstract

The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliation information zone where the characters are in italics or small fonts. Therefore only affiliation information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. and OCR confidence levels. The motivation for this research is that if a correlation between the character style and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. We have determined that this correlation exists, in particular for the case of characters with diacritics.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">A study of style effects on OCR errors in the MEDLINE database</title>
<author>
<name sortKey="Garrison, Penny" sort="Garrison, Penny" uniqKey="Garrison P" first="Penny" last="Garrison">Penny Garrison</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Davis, Diane" sort="Davis, Diane" uniqKey="Davis D" first="Diane" last="Davis">Diane Davis</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Andersen, Tim" sort="Andersen, Tim" uniqKey="Andersen T" first="Tim" last="Andersen">Tim Andersen</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Smith, Elisa Barney" sort="Smith, Elisa Barney" uniqKey="Smith E" first="Elisa Barney" last="Smith">Elisa Barney Smith</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">05-0359370</idno>
<date when="2005">2005</date>
<idno type="stanalyst">PASCAL 05-0359370 INIST</idno>
<idno type="RBID">Pascal:05-0359370</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000464</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000324</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000437</idno>
<idno type="wicri:doubleKey">1017-2653:2005:Garrison P:a:study:of</idno>
<idno type="wicri:Area/Main/Merge">001467</idno>
<idno type="wicri:Area/Main/Curation">001421</idno>
<idno type="wicri:Area/Main/Exploration">001421</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">A study of style effects on OCR errors in the MEDLINE database</title>
<author>
<name sortKey="Garrison, Penny" sort="Garrison, Penny" uniqKey="Garrison P" first="Penny" last="Garrison">Penny Garrison</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Davis, Diane" sort="Davis, Diane" uniqKey="Davis D" first="Diane" last="Davis">Diane Davis</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Andersen, Tim" sort="Andersen, Tim" uniqKey="Andersen T" first="Tim" last="Andersen">Tim Andersen</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Smith, Elisa Barney" sort="Smith, Elisa Barney" uniqKey="Smith E" first="Elisa Barney" last="Smith">Elisa Barney Smith</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>College of Engineering, Boise State University</s1>
<s2>Boise, Idaho</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Idaho</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="2005">2005</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Accuracy</term>
<term>Automatic recognition</term>
<term>Automatic system</term>
<term>Character recognition</term>
<term>Confidence interval</term>
<term>Database</term>
<term>Error correction</term>
<term>Feature extraction</term>
<term>Human operator</term>
<term>Medical application</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Performance evaluation</term>
<term>Productivity</term>
<term>Signal processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
<term>Base donnée</term>
<term>Application médicale</term>
<term>Système automatique</term>
<term>Extraction caractéristique</term>
<term>Reconnaissance automatique</term>
<term>Evaluation performance</term>
<term>Reconnaissance caractère</term>
<term>Précision</term>
<term>Opérateur humain</term>
<term>Correction erreur</term>
<term>Intervalle confiance</term>
<term>Productivité</term>
<term>Traitement signal</term>
<term>Reconnaissance forme</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Base de données</term>
<term>Productivité</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The National Library of Medicine has developed a system for the automatic extraction of data from scanned journal articles to populate the MEDLINE database. Although the 5-engine OCR system used in this process exhibits good performance overall, it does make errors in character recognition that must be corrected in order for the process to achieve the requisite accuracy. The correction process works by feeding words that have characters with less than 100% confidence (as determined automatically by the OCR engine) to a human operator who then must manually verify the word or correct the error. The majority of these errors are contained in the affiliation information zone where the characters are in italics or small fonts. Therefore only affiliation information data is used in this research. This paper examines the correlation between OCR errors and various character attributes in the MEDLINE database, such as font size, italics, bold, etc. and OCR confidence levels. The motivation for this research is that if a correlation between the character style and types of errors exists it should be possible to use this information to improve operator productivity by increasing the probability that the correct word option is presented to the human editor. We have determined that this correlation exists, in particular for the case of characters with diacritics.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Idaho</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Idaho">
<name sortKey="Garrison, Penny" sort="Garrison, Penny" uniqKey="Garrison P" first="Penny" last="Garrison">Penny Garrison</name>
</region>
<name sortKey="Andersen, Tim" sort="Andersen, Tim" uniqKey="Andersen T" first="Tim" last="Andersen">Tim Andersen</name>
<name sortKey="Davis, Diane" sort="Davis, Diane" uniqKey="Davis D" first="Diane" last="Davis">Diane Davis</name>
<name sortKey="Smith, Elisa Barney" sort="Smith, Elisa Barney" uniqKey="Smith E" first="Elisa Barney" last="Smith">Elisa Barney Smith</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001421 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001421 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:05-0359370
   |texte=   A study of style effects on OCR errors in the MEDLINE database
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024